NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c-part2 / 13590 < prev next >

Wrap

Text File | 1996-08-05 | 11.0 KB | 362 lines

Path: solon.com!not-for-mail From: Chris Torek <torek@elf.bsdi.com> Newsgroups: comp.lang.c,comp.lang.c.moderated Subject: Re: C coding problem Date: 8 Apr 1996 22:46:58 -0500 Organization: Berkeley Software Design, Inc. Sender: clc@solutions.solon.com Approved: clc@solutions.solon.com Message-ID: <4kcmji$p13@solutions.solon.com> References: <4j06na$808@solutions.solon.com> <4k1qh3$5hn@solutions.solon.com> <4k5vrk$a2d@solutions.solon.com> <4kbdq6$ec2@solutions.solon.com> Reply-To: torek@bsdi.com NNTP-Posting-Host: solutions.solon.com (I have removed a bunch of HP-specific groups.) Someone wrote: >>>In such and environment, >>> *q++ = *p++; >>>[can be slower than] >>> q[i] = p[i], ++i; In article <4k5vrk$a2d@solutions.solon.com> schwarz@mips.complang.tuwien.ac.at (Konrad Schwarz) wrote: >>Note that indexed addressing often works only for small powers of >>two, e.g., x86 family. In article <4kbdq6$ec2@solutions.solon.com> Verne Arase <VArase@varase.it.luc.edu> wrote: >Why only small powers of two? You mean if the size of each element is a >small power of two (so you can use shift rather than multiply)? The reason that it generally works `only for small powers of two' for such systems is that the hardware indexed-addressing modes only support the hardware's datatypes, which typically use 1, 2, 4, and 8 8-bit bytes. This is called `scaled indexed addressing': if p is a `long *', and long is 8 bytes, p[i] might use the `8 byte scaled index' mode to multiply i by 8 before adding it to p, all in one instruction. (Incidentally, on machines with rapid scaled index addressing and `load effective address' instructions, these can often be used to synthesize multiplication by values such as 3, 5, and 9 in a single instruction. GCC will do this for you when possible. This turns out to be useful more often you might expect: calculations of the form `x * 10' are common, and 10 is 2*5, and x*5 is `lea x,x[x,4]'.) >AFAIK, indexed addressing works with just about all CISC computers, from >the x86 and 68K line through S/360/370/390. There are CISCs that lack indexed addressing, and RISCs that have it, and on some implementations of some CISC architectures, the scaled indexing addressing modes are slower than some alternatives. It is not generally possible to determine the fastest method of copying memory without a lot more information about the hardware. (For instance, the SPARC has indexed addressing, but not scaled indexed addressing, so that the index must be a byte offset.) In article <4k60p9$ahr@solutions.solon.com> falstaff@xs4all.nl disparaged code of the form: >void loserstrcpy(char p[],char q[]) >{ int i=0; > > while((q[i]=p[i])!=0) > i++; >} As it happens, however, a literal translation of this code to SPARC assembly performs better than a literal translation of the pointer version. Finally, I present here some code I wrote and tested specifically for a `kernel bcopy' implementation (analagous to the Standard's memcpy---i.e., does not handle overlap---but lacks a return value). This is the fastest method I have found yet for the SPARC. Note that it uses both indexing and pointer increments, all carefully chosen based on the target instruction set. The code is byte-order dependent and `tuned' for gcc (and yet still requires some minor hand-tuning afterward, as gcc creates some unnecessary pointers within the unrolled inner loops). The fancy alignment network at the end, while complex, is faster than byte-at-a-time copies, on the machines I tested. (Note that the alignment network can be recoded for little-endian machines, even those that do not impose alignment constraints in hardware. On the other hand, many of those have specialized memory-copy instructions (e.g., VAX, x86), are register-poor (x86), have different relative CPU-to-memory bandwidth, or have some or all of these and other factors that make it unprofitable. The MIPS Rx000, which is either-endian and has hardware alignment constriaints, has special LWL and LWR instructions for building the left and right parts, so that this should not be attempted directly in C.) /* prototype bcopy, for turning into assembly... */ void bcopy(void *src0, void *dst0, unsigned int len) { char *src = src0, *dst = dst0; unsigned int n, i; unsigned int lp, rp; /* left and right partial words */ i = 0; n = (int)src | (int)dst; /* * Handle small copies separately. We *must* handle 6 or fewer * bytes here as the fancier code may copy 7 without looking at * the length, but we might as well take care of anything under 8. */ if (len < 8) { /* * Too small to fuss much, but handle 4-aligned-bytes * quickly. */ if (len >= 4 && (n & 3) == 0) { *(int *)dst = *(int *)src; i = 4; } while (i < len) dst[i] = src[i], i++; return; } /* * We have at least 8 bytes to copy. Handle easy doubleword * copies (everything perfectly aligned) first. */ if ((n & 7) == 0) goto copy_doublewords; /* * No matter which algorithm we choose, the destination must be * int-aligned. If it is not even short-aligned, fix that with * a one-byte copy now, then choose an alignment algorithm. */ n = (int)src ^ (int)dst; if ((int)dst & 1) *dst++ = *src++, len--; if ((n & 7) == 0) { /* * Align to doubleword, then copy doublewords. Move a * short and/or an int first, if necessary. This plus the * byte we might have copied above means that len had * better be at least 7. */ if ((int)dst & 2) *(short *)dst = *(short *)src, dst += 2, src += 2, len -= 2; if ((int)dst & 4) *(int *)dst = *(int *)src, dst += 4, src += 4, len -= 4; copy_doublewords: /* Copy via ldd/std, with an unrolled inner loop. */ n = len / 8 / 8; if (n) { len -= n * 8 * 8; #define CP(k) ((long long *)dst)[k] = ((long long *)src)[k] do { CP(0); CP(1); CP(2); CP(3); CP(4); CP(5); CP(6); CP(7); dst += 8 * 8, src += 8 * 8; } while (--n > 0); #undef CP } /* Handle any trailing `long long's. */ n = len & ~7; while (i < n) *(long long *)&dst[i] = *(long long *)&src[i], i += 8; /* Mop up trailing int, short, and/or byte. */ if ((n = len - i) == 0) return; if (n >= 4) { *(int *)&dst[i] = *(int *)&src[i], i += 4; goto trailing_check; } goto trailing_123; } if ((n & 3) == 0) { /* * Align to int, then copy ints. This is just like the * doubleword loop, but moves only four bytes at a crack. */ if ((int)dst & 2) *(short *)dst = *(short *)src, dst += 2, src += 2, len -= 2; n = len / 4 / 8; if (n) { len -= n * 4 * 8; #define CP(k) ((int *)dst)[k] = ((int *)src)[k] do { CP(0); CP(1); CP(2); CP(3); CP(4); CP(5); CP(6); CP(7); dst += 4 * 8, src += 4 * 8; } while (--n > 0); #undef CP } n = len & ~3; while (i < n) *(int *)&dst[i] = *(int *)&src[i], i += 4; trailing_check: if ((n = len - i) == 0) return; trailing_123: if (n >= 2) { /* 2 or 3 bytes left */ *(short *)&dst[i] = *(short *)&src[i]; if (n == 2) return; i += 2; } dst[i] = src[i]; return; } /* * Cannot even int-align, but still have at least 7 bytes to copy, * and dst is at least short-aligned. Copy up to two more bytes * to make it int-aligned, leaving at least 5 bytes to go. This * will leave src unaligned with a 1, 2, or 3-byte offset. * * This code is rather complex; see the pictures below for operation. */ if ((int)dst & 3) { *dst++ = *src++; len--; if ((int)dst & 3) { *dst++ = *src++; len--; } } n = len / 4; /* n >= 1 */ if (((int)src & 3) == 1) { /* * src dst * +---------------+ +---------------+ * | | 0 | 1 | 2 | | 0 | 1 | 2 | 3 | * +---------------+ +---------------+ * | 3 | . | . | . | | . | . | . | . | * +---------------+ +---------------+ * | . | w | x | y | | w | x | y | z | * +---------------+ +---------------+ * | z | ? | ? | ? | | ? | ? | ? | | * +---------------+ +---------------+ * * (where `.'s repeat as needed and `z' is the last of * the fullword-at-dst bytes). The `empty' slots always * exist at src, but need not exist at dst, and we may * may need to write none, one, two, or all three of the * bytes marked `?' (depending on the leftover length). */ /* Pick up the first three bytes, discarding -1st byte. */ lp = *(int *)&src[-1] << 8, src += 3; do { /* lp contains 3 `left hand' bytes */ rp = *(int *)&src[i]; /* now rp has `right hand' byte in top nibble */ *(int *)&dst[i] = lp | (rp >> 24); lp = rp << 8; i += 4; } while (--n > 0); /* Up to 3 bytes remain; they are available in lp. */ if ((len -= i) != 0) { if (len == 1) dst[i] = lp >> 24; else if (len == 2) *(short *)&dst[i] = lp >> 16; else { *(short *)&dst[i] = lp >> 16; dst[i + 2] = lp >> 8; } } } else if (((int)src & 3) == 2) { /* * src dst * +---------------+ +---------------+ * | | | 0 | 1 | | 0 | 1 | 2 | 3 | * +---------------+ +---------------+ * | 2 | 3 | . | . | | . | . | . | . | * +---------------+ +---------------+ * | . | . | w | x | | w | x | y | z | * +---------------+ +---------------+ * | y | z | ? | ? | | ? | ? | ? | + * +---------------+ +---------------+ * | ? | * +---+ * * Note that this time we may have to fetch one last * byte (the third `?') from src when the full-word * loop is done. */ /* Pick up the first two bytes. */ lp = *(unsigned short *)&src[0] << 16, src += 2; do { /* lp contains 2 `left hand' bytes */ rp = *(int *)&src[i]; /* now rp has two `right hand' bytes */ *(int *)&dst[i] = lp | (rp >> 16); lp = rp << 16; i += 4; } while (--n > 0); /* Up to 3 bytes remain, but only 2 are in lp. */ if ((len -= i) != 0) { if (len == 1) dst[i] = lp >> 24; else if (len == 2) *(short *)&dst[i] = lp >> 16; else { *(short *)&dst[i] = lp >> 16; rp = src[i]; i += 2; dst[i] = rp; } } } else /* (((int)src & 3) == 3) */ { /* * src dst * +---------------+ +---------------+ * | | | | 0 | | 0 | 1 | 2 | 3 | * +---------------+ +---------------+ * | 1 | 2 | 3 | . | | . | . | . | . | * +---------------+ +---------------+ * | . | . | . | w | | w | x | y | z | * +---------------+ +---------------+ * | x | y | z | ? | | ? | ? | ? | | * +---------------+ +---------------+ * | ? | ? | * +---+---+ * * Now we may have to read one or two last bytes from src * when the full-word loop is done. */ lp = *src++ << 24; do { rp = *(int *)&src[i]; /* now rp has three `right hand' bytes */ *(int *)&dst[i] = lp | (rp >> 8); lp = rp << 24; i += 4; } while (--n > 0); /* Up to 3 bytes remain, but only one is in lp. */ if ((len -= i) != 0) { if (len == 1) dst[i] = lp >> 24; else { rp = *(unsigned short *)&src[i]; lp |= rp << 8; *(short *)&dst[i] = lp >> 16; if (len == 3) { i += 2; dst[i] = rp /* lp >> 8 */; } } } } } -- In-Real-Life: Chris Torek, Berkeley Software Design Inc El Cerrito, CA Domain: torek@bsdi.com +1 510 234 3167